The effect of prediction error on episodic memory encoding is modulated by the outcome of the predictions

Expectations can lead to prediction errors of varying degrees depending on the extent to which the information encountered in the environment conforms with prior knowledge. While there is strong evidence on the computationally specific effects of such prediction errors on learning, relatively less evidence is available regarding their effects on episodic memory. Here, we had participants work on a task in which they learned context/object-category associations of different strengths based on the outcomes of their predictions. We then used a reinforcement learning model to derive subject-specific trial-to-trial estimates of prediction error at encoding and link it to subsequent recognition memory. Results showed that model-derived prediction errors at encoding influenced subsequent memory as a function of the outcome of participants’ predictions (correct vs. incorrect). When participants correctly predicted the object category, stronger prediction errors (as a consequence of weak expectations) led to enhanced memory. In contrast, when participants incorrectly predicted the object category, stronger prediction errors (as a consequence of strong expectations) led to impaired memory. These results highlight the important moderating role of choice outcome that may be related to interactions between the hippocampal and striatal dopaminergic systems.

In Experiment 1, there was a significant interaction between prediction outcome and contingency condition, χ 2 (1) = 9.40, p = .009. To break down the interaction, recognition memory was analyzed separately for incorrect and correct predictions. For correct predictions, the difference between the contingency conditions 0.33 and 0.80 was significant, β = 0.86, p corr <.001, OR = 2.36, showing that the items presented when participants correctly predicted the 0.33 category were remembered better than the items presented when participants correctly predicted the 0.80 category. In addition, there was a trend for improved memory for items presented when correctly choosing the 0.33 compared to the items presented when participants correctly predicted the 0.20 category, β = 01.09, p corr =.056, OR = 2.99. The other pairwise comparison did not reach significance, p corr = 1. For incorrect predictions, memory tended to be worse for higher predicted contingency conditions, but this did not reach significance, χ 2 (1) = 4.00, p = .133. In Experiment 2, there was a main effect of contingency condition, χ 2 (2) = 13.56, p = .009, while the main effect of prediction accuracy and the interaction between prediction outcome and predicted contingency were not significant, ps corr > .113. Pairwise comparisons showed that predicting the contingency condition 0.50 led to improved recognition for the items, compared to predicting the contingency condition 0.90, β = 0.34, p corr =.010, OR = 1.41. Predicting the contingency condition 0.70 also led to improved item memory, compared to predicting the contingency condition 0.90, β = 0.27, p corr =.007, OR = 1.31. The other comparisons did not reach significance, ps corr > .170.
Taken together, these results show that the contingency condition of the category predicted affected the likelihood of remembering the object presented, and that this effect was partially modulated by whether the prediction was correct or incorrect. Specifically, for incorrect predictions, predicting an object category which belonged to a higher contingency condition tended to be detrimental to memory. Incorrectly predicting higher contingency condition category is a condition that is likely to generate higher aggregated PE, because participants' expectations were higher and thus they tended to predict the most likely category.
For correct predictions, predicting categories belonging to lower contingency conditions (0.5 contingency condition in Experiment 1, 0.5 and 0.7 contingency conditions in Experiment 2) tended to lead to better memory. These conditions are likely to generate higher aggregated PE, as participants' expectations were lower.

Dirichlet-Multinomial Model
We used the Dirichlet-multinomial model to formalize learning as optimal Bayesian updating of the Dirichlet distribution (1). We apply the Multinomial distribution because the Categorical distribution that we use in our task is a special case of the Multinomial distribution where only one outcome is sampled on each trial. That is, the outcomes x of the task were drawn from a Multinomial distribution Mu(x|1, θ) =: Cat(x|θ) (S1) where p(x = j|θ) = θ j . For example, in the weak prior condition, θ = [0.33, 0.33, 0.33].
The Dirichlet distribution is the conjugate distribution of the Multinomial distribution and can therefore be utilized as a prior. This distribution is parameterized by the concentration parameters α = (α 1 , α 2 , ..., α J ).
We use the Dirichlet distribution to model the participants' prior expectations (i.e., at the beginning of the learning phase) about the category probabilities, which are often called pseudo-counts. Here we assume that participants start the task with a flat prior that reflects that all categories are equally likely, which corresponds to α = (1, 1, ..., 1).
These values thus indicate the assumption that each category has been pseudo-counted once.
To obtain the posterior, the only operation required is adding the observed data to the prior. In order to obtain an estimate of the category probabilities, we can compute the expected value of the posterior, referred to as the maximum a posteriori (MAP) estimate: T j denotes how many times category j was presented and T refers to the total number of trials. Under the assumption that α = (1, 1, ..., 1), the MAP estimate is equal to the maximum likelihood (ML) estimate that is based on the empirically observed frequency of the categories:θ

Delta-Rule Formulation
We now show how eq. (S3) can be translated into the delta rule. Letθ t,j denote the estimate of the jth category probability on trial t. Then the estimate of the jth category on trial t + 1, denoted byθ t+1,j , can be computed according tô where δ t,j := (x t,j −θ t,j ) corresponds to the prediction error and α t := 1 t is the learning rate (2).

Parameter Recovery
In order to check whether the models could successfully recover the parameters, fake data were first simulated with known parameters. Next, models were fit to the simulated data and parameters of best fit were estimated. Finally, the recovered parameters were compared with the known parameters. Graphs showing the results of parameter recovery are presented in Figure S2 and S3. High correlation between the simulated and fitted parameters indicates successful recovery. Note that the inverse temperature parameter was constrained between 0 and 10 as previous parameter recovery attempts showed that recovery was not reliable for parameters above that range.

Model Recovery
To assess the ability of a model to successfully distinguish between different models, a model recovery procedure was used. Data from the three different models were simulated and then fit to each of the models to determine which model fit best. This procedure was repeated 100 times. The confusion matrices shown in Figure S4 show the results of this procedure.
Each cell represents the probability of data simulated by models in the X axis to be best fit by models in the Y axis. Higher probabilities in the diagonal means that the models can successfully recover the models from which the data were generated.

Simulated and Empirical Comparison for dLRI, dfLRI, and fLRE
Models Figure S5 shows comparisons between simulated and actual data for the instructive model with a decreasing learning rate (dLRI), the instructive model with a decreasing learning rate that was free to vary between participants (dfLRI), and the evaluative model with a free learning rate (fLRE).

Learning Rate Validation
To check whether the model captured participants' differences in learning rate, we compared cumulative accuracy for simulated and empirical data at different values of learning rate α.
For simulated and empirical data, quartiles for α were calculated (zeroth, first, second, third, and fourth quartile), and cumulative accuracy was aggregated for the data points between one quartile and the previous one. Cumulative accuracy for the four bins created is shown in Figure S6, as a function of order of type of data (empirical vs simulated), and Experiment (1 vs 2). In both Experiments, the simulated data mirrored the pattern of the actual data, showing that the model was capable of capturing observed effects of individual differences in learning rate on cumulative accuracy.

Analysis with Binned Hit Rate
In order to compare memory at different levels of the model-derived PE, we calculated the quartiles for PE for each participant, separately for trials with correct and incorrect prediction outcome. We binned the hit rate by aggregating it between the quartiles, to create four bins which eventually were used as the explanatory variable in our analysis. A graph with the distribution of PE by binned data is shown in Figure S7.
Hit rate as a function of binned PE and prediction outcome is shown in Figure   S8. We then tested for the three-way interaction between binned PE, prediction outcome, and experiment, in a linear mixed-effects model, adding participants as random effects. The three-way interaction was not significant, χ 2 (3) = 4.23, p = .238. In addition, the interactions between PE and experiment, and the interaction between prediction outcome and exper-iment were not significant (χ 2 (3) = 1.68, p = .642, χ 2 (3) = 0.81, p = 0.368, respectively). These results suggest that there were no significant differences in the effects of PE and in the interaction between PE and prediction outcome between the two experiments. By contrast, there was a main effect of Experiment, χ 2 (3) = 7.70, p = .005, showing that overall participants' performance was significantly worse in Experiment 2, compared to Experiment 1. Importantly, there was also a significant interaction between PE and prediction outcome, (3) = 14.09, p = .003. To break down the interaction, the effect of PE on recognition was analyzed separately for correct and incorrect prediction outcomes. We compared each bin with the first one, to test whether increasingly higher PE significantly affected memory encoding. Results showed that for incorrect prediction outcomes, the difference between the first and the second was significant (β = -0.0557, p corr = .040, OR = 1.06). In addition, the comparisons between the third and the first, and the fourth and the first, were both significant, (p corr < .001). For correct prediction outcomes, the comparison between first and second quantile, and first and third quantile did not reach significance (ps corr >.114), whereas the comparison between the fourth and the first quantile was significant (β = 0.081, p corr = .009, OR = 1.08). These results suggest that while a PE error generated by incorrect prediction impairs memory even when prior expectations are not very strong, for correct predictions a higher PE is needed in order to observe benefits for memory encoding.

Analysis of Choice-dependent PE and Memory
In addition to considering PE as depending on the category presented in each trial, we also decided to analyze the effects PE dependent on participants' choice on memory. Such PE is similar to the signed PE considered by previous studies (e.g., 3; 4) and thus allows for a direct comparison between the current study and the previous ones. A graph of the relationship between choice-dependent PE and recognition accuracy is presented in Figure S9.

Analysis of Memory as a Function of Subsequent Similar Scenes
Following suggestions from a reviewer, we tested the possibility that the the number of subsequent similar scenes that followed each encoded image could modulate encoding strength.
The reason behind this is that high PE can be beneficial in the case of successful predictions because it can be useful for predicting similar subsequent trials. The relationship between the number of subsequent similar contexts, recognition memory, and prediction outcome is shown in Figure S10. This analysis showed that the main effects of number of subsequent scenes and prediction outcome, and their interaction, were not significant, ps >.097.